Reliability estimator for ad hoc applications

ABSTRACT

In certain embodiments, a computer-implemented method includes receiving a request for a reliability estimate associated with an ad hoc application. In response to the request, one or more components associated with the ad hoc application and upon which the ad hoc application relies are identified. The method also includes generating a directed graph. The directed graph identifies one or more dependency relationships among the identified components. The method also includes calculating, based at least in part on the directed graph, a reliability estimate for the ad hoc application.

CROSS REFERENCE TO RELATED APPLICATION

This application, U.S. patent application Ser. No. 15/449,814, alongwith U.S. patent application Ser. No. 15/466,626 filed on Mar. 22, 2017,are both reissue applications of U.S. patent application Ser. No.13/223,972, filed Sep. 1, 2011, now U.S. Pat. No. 8,972,564, entitled“RELIABILITY ESTIMATOR FOR AD HOC APPLICATIONS.”

BACKGROUND

Reliability is an important business property. Reliability, however, canbe difficult to measure in a distributed system comprising manydisparate components with differing levels of availability andredundancy. This is particularly true when portions of the serviceinfrastructure are purchased from another company, which may not revealdetails of its internal infrastructure. Formal models, end-to-end systemdescriptions, and simple, uncorrelated modes of failure may beinadequate in more complicated systems in which internal components areobscured from a user.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present disclosure and itsadvantages, reference is made to the following descriptions, taken inconjunction with the accompanying drawings, in which:

FIG. 1 illustrates an example system for estimating reliability of an adhoc application, according to certain embodiments of the presentdisclosure;

FIG. 2 is a block diagram illustrating an example process forcalculating a reliability estimate that that may be performed by theexample system of FIG. 1, according to certain embodiments of thepresent disclosure;

FIG. 3 illustrates an example application definition including a primaryresource and two secondary resources, according to certain embodimentsof the present disclosure;

FIG. 4 illustrates an example application definition expanded to includeseveral application components found using tag associations, accordingto certain embodiments of the present disclosure;

FIG. 5 illustrates an example directed graph constructed by the examplesystem of FIG. 1 in which the example application definition of FIG. 4is expanded to include several application and infrastructure componentsfound using allocation and dependency relationships;

FIG. 6 illustrates conditional probability tables 600a-c for applicationand infrastructure components included in the example directed graphillustrated in FIG. 5, according to certain embodiments of the presentdisclosure;

FIG. 7 illustrates an example table that includes the results of anexample series of trials performed by the example system of FIG. 1 tocalculate a reliability estimate, in accordance with particularembodiments of the present disclosure; and

FIG. 8 illustrates an example computer system that may be used for oneor more portions of the example system of FIG. 1, according to certainembodiments of the present disclosure.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

Cloud providers deliver a set of services that can be used to constructapplications in a reliable, scalable, and inexpensive manner. Thesebenefits, however, should be obtained by using the services in a carefulmanner. While some properties such as cost are relatively easy tomeasure, other properties such as reliability are not. Past solutions tomeasure reliability of interconnected components, systems, and/orapplications have included manual efforts to calculate applicationreliability from fault trees, reliability block diagrams, and othermodeling approaches. These calculations often require access to exactnetwork schematics and aggregate reliability data, which may be highlyconfidential and proprietary business information. Alternatively,efforts to calculate application reliability have treated these factorsas black boxes, which limits the estimation of reliability tocoarse-grain measures. These methods often incorrectly assume that allfailures are independent and that the rate at which failures occurs isconstant.

Particular embodiments of the present disclosure address these and otherlimitations of previous systems by incorporating user input of anapplication definition and relationships between computing resources todetermine an infrastructure and application configuration. Based onhistorical availability of the infrastructure and application resources,conditional probability tables are generated that indicate theavailability of infrastructure and application components under variouscircumstances (such, as, e.g., whether directly relied upon componentsare available or not available). A reliability estimate is generated byrunning a large number of successive trials in which the availability ornon availability of an infrastructure or application component isdetermined in accordance with the statistical probabilities indicated inthe generated conditional probability tables. Thus, the reliabilityestimate may then be estimated based on the aggregate number of times anad hoc application is determined to be available or not available in thetotal number of trials. The reliability estimate may be transmitted to auser of an ad hoc application.

FIG. 1 illustrates an example system 100 for a reliability estimator forad hoc applications, according to certain embodiments of the presentdisclosure. In the illustrated example, system 100 includes a usersystem 102, a network 104, a server system 106, a storage module 108,and one or more computing resources 110. Although system 100 isillustrated and primarily described as including particular components,the present disclosure contemplates system 100 including any suitablecomponents, according to particular needs.

In general, portions of system 100 provide an environment in which oneor more computing resources (e.g., computing resources 110) is madeavailable over a communication network (e.g., network 104) to one ormore remote computer systems, such as user system 102. In certainembodiments, server system 106, storage module 108, and computingresources 110 may be communicatively coupled together over a high speedcommunication network and collectively may comprise a computinginfrastructure, which may be referred to as a provisioned computingresources environment 112. User system 102 and/or network 104 may beexternal to provisioned computing resources environment 112 and may bereferred to as an external computing environment 114.

In certain embodiments, provisioned computing resources environment 112(including, for example, one or more of server system 106, storagemodule 108, and computing resources 110) may provide a collection ofremote computing services offered over a network (which may or may notbe network 104). Those computing services may include, for example,storage, computer processing, networking, applications, or any othersuitable computing resources that may be made available over a network.In some embodiments, computing resources may be referred to as ad hocapplications, which may be provisioned or de-provisioned according tothe requirements and/or configuration of external computing environment114. In certain embodiments, entities accessing those computing servicesmay gain access to a suite of elastic information technology (IT)infrastructure services (e.g., computing resources 110) as the entityrequests those services. Provisioned computing resources environment 112may provide a scalable, reliable, and secure distributed computinginfrastructure.

In association with making those computing resources 110 available overthe network (e.g., provisioning the computing resources 110), a varietyof reliability parameters may be generated. These reliability parametersmay indicate or represent the availability or non-availability of aparticular provisioned ad hoc application (or its underlyinginfrastructure or application components) to user system 102 or externalcomputing environment 114. Reliability parameters may be referred to asreliability metrics data. Server 106 uses reliability metrics data todetermine a reliability estimate for one or more ad hoc applications.Reliability metrics data may be associated with a particular component,system, software, application, interface, and/or network included inprovisioned computing resources environment 112. Particular examples ofreliability metrics data may include user reliability data 124, instancereliability data 126, and class reliability data 128, discussed furtherbelow.

Portions of system 100 may determine reliability metrics data associatedwith components of system 100 (e.g., computing resources 110). It may beappropriate to communicate a portion or all of this reliability metricsdata over a network (e.g., network 104) to a server so that the server(e.g., server system 106) may use the communicated reliability metricsdata. For example, reliability metrics data may be communicated over anetwork (e.g., network 104) to a server (e.g., server system 106), sothat server system 106 may calculate reliability estimate 134 for one ormore ad hoc applications. A particular reliability estimate 134 may becommunicated over network 104 to user system 102 in response to a queryfor reliability data associated with a particular ad hoc application.

User system 102 may include one or more computer systems at one or morelocations. Each computer system may include any appropriate inputdevices, output devices, mass storage media, processors, memory, orother suitable components for receiving, processing, storing, andcommunicating data. For example, each computer system may include apersonal computer, workstation, network computer, kiosk, wireless dataport, personal data assistant (PDA), one or more Internet Protocol (IP)telephones, smart phones, table computers, one or more servers, a serverpool, one or more processors within these or other devices, or any othersuitable processing device. User system 102 may be a stand-alonecomputer or may be a part of a larger network of computers associatedwith an entity.

User system 102 may include processing unit 116 and memory unit 118.Processing unit 116 may include one or more microprocessors,controllers, or any other suitable computing devices or resources.Processing unit 116 may work, either alone or with other components ofsystem 100, to provide a portion or all of the functionality of system100 described herein. Memory unit 118 may take the form of volatile ornon-volatile memory including, without limitation, magnetic media,optical media, RAM, ROM, removable media, or any other suitable memorycomponent.

In general, user system 102 communicates tag information 138 andapplication definition 140 to server system 106 to facilitatereliability estimation for an ad hoc application. First, user system 102may interact with component tagging module 142 to apply one or moremetadata tags (e.g., tag information 138) to computing resources 110. Ametadata tag may be a short, textual string that describes one or moreaspects of the relevant computing resource 110. For example, if usersystem 102 is provisioned with an ad hoc application (e.g., anaccounting software package) that runs on two processing computingresources 110 and one database computing resource 110, the user may tageach of the computing resources 110 with the string ‘accounting’ toassociate the computing resources with the provisioned ad hocapplication. Tag information 138 may also describe configurationrelations. For example, tag information 138 may link resources withresource addresses, access control policies, firewall rules, orconnection strings. In general, tag information 138 includes metadatainformation that associates a particular computing resource 110 with anad hoc application provided to user system 102.

Second, user system 102 may interact with application definition module144 to create an application definition (e.g., application definition140) of a provisioned ad hoc application. Application definition 140includes at least a primary computing resource 110 for which reliabilityestimate 134 is to be calculated. Application definition 140 may includeone or more secondary computing resources 110 that are supportive of theprimary computing resource 110. For example, the primary computingresource 110 may be software service while a secondary computingresource 110 may be a web service accessed by the software service. Insome embodiments, application definition 140 may not define allsecondary computing resources 110 used by a particular ad hocapplication. Graph inference module 146 may expand the user-provided adhoc application definition 140 into a more comprehensive applicationdefinition. In some embodiments, application definition module 144defines the starting seeds for graph inference module 146. Graphinference module 146 is discussed in greater detail below with respectto FIGS. 2 and 5.

A user of user system 102 may include, for example, a person capable ofrequesting and receiving a reliability estimate for an ad hocapplication. As a more particular example, a user of system 102 may beassociated with an entity using the computing resources (e.g., computingresources 110) made available over a network.

Network 104 facilitates wireless or wireline communication. Network 104may communicate, for example, IP packets, Frame Relay frames,Asynchronous Transfer Mode (ATM) cells, voice, video, data, and othersuitable information between network addresses. Network 104 may includeone or more local area networks (LANs), radio access networks (RANs),metropolitan area networks (MANs), wide area networks (WANs), mobilenetworks (e.g., using WiMax (802.16), WiFi (802.11), 3G, or any othersuitable wireless technologies in any suitable combination), all or aportion of the global computer network known as the Internet, and/or anyother communication system or systems at one or more locations, any ofwhich may be any suitable combination of wireless and wireline.

Server system 106 may include one or more computer systems at one ormore locations. Each computer system may include any appropriate inputdevices, output devices, mass storage media, processors, memory, orother suitable components for receiving, processing, storing, andcommunicating data. For example, each computer system may include apersonal computer, workstation, network computer, kiosk, wireless dataport, PDA, one or more IP telephones, one or more servers, a serverpool, one or more processors within these or other devices, or any othersuitable processing device. Server system 106 may be a stand-alonecomputer or may be a part of a larger network of computers associatedwith an entity.

Server system 106 may include processing unit 122 and memory unit 124.Processing unit 122 may include one or more microprocessors,controllers, or any other suitable computing devices or resources.Processing unit 122 may work, either alone or with other components ofsystem 100, to provide a portion or all of the functionality of system100 described herein. Memory unit 124 may take the form of volatile ornon-volatile memory including, without limitation, magnetic media,optical media, RAM, ROM, removable media, or any other suitable memorycomponent.

Server system 106 may calculate reliability estimate 134 for one or moread hoc applications. In particular, server system 106 calculatesreliability estimate 134 based on data received from or determined inconjunction with other components of system 100. In particular, serversystem 106 may calculate reliability estimate 134 based on one or moreof reliability data 124, instance reliability data 126, classreliability 128, infrastructure repository 136, user tag information138, and application definition 140. As described further below, serversystem 106 may process this received data using one or more of componenttagging module 142, application definition module 144, graph inferencemodule 146, reliability estimator module 148, application probabilitymodule 150, and infrastructure probability module 152 to calculatereliability estimate 134.

User reliability data 124 represents historical availability ornon-availability of a computing resource 110 (e.g., an ad hocapplication) as determined by direct observation of user system 102. Forexample, user system 102 may periodically perform a health check on anad-hoc application to determine whether the ad hoc application isoperational. User system 102 may communicate the results of the healthcheck to server system 106, which may store the results as userreliability data 124.

Instance reliability data 126 represents the historical availability ornon-availability of a particular computing resource 110 (e.g., a server,disk drive, network interface, power supply, etc.). For example, one ormore components of system 100 (e.g., server system 106) may periodicallyperform a health check of infrastructure components to determine theirrespective availability or non-availability to user system 102. Serversystem 106 stores the results of the health check as instancereliability data 126.

Class reliability data 128 represents the historical availability ornon-availability of a particular class of computing resources 110. Forexample, one or more components of system 100 (e.g., server system 106)may periodically perform a health check of one or more similar computingresources 110 to determine the availability or non-availability as aclass. Class reliability data 128 may bias towards components withmeasures of similarity; such as hardware revision, order date, time inservice, installation location, or maintenance record. In someembodiments, class reliability data 128 may be used as a proxy forinstance reliability data 126 if or when instance reliability data 126is unavailable for a particular computing resource 110.

Infrastructure repository 136 stores information related to computingresources 110. For example, server system 106 may store a hardware type,hardware parameters (e.g., processor speed, storage space, etc.),hardware revision, order date, time in service, installation location,or maintenance record for each computing resource 110 in infrastructurerepository 136. Infrastructure repository 136 may additionally storedetails regarding the connections between computing resources 110, suchas network links, network speeds, network availability, and/orconnection type. In some embodiments, server system 106 may storeinformation related to computing resources 110 in a database on storagemodule 108.

Component tagging module 142 receives tag information 138 from usersystem 102 and stores tag information 138 in storage module 108. Asdiscussed above, tag information 138 indicates relationships betweencomputing resources 110 and an ad hoc application.

Application definition module 144 receives application definition 140from user system 102 and stores application definition 140 in storagemodule 108. Application definition 140 identifies one or more componentcomputing resources 110 for an ad hoc application.

Graph inference module 146 constructs a directed graph (e.g., directedgraph 500 illustrated in FIG. 5) including application components andone or more infrastructure components. Directed graph 500 may beconstructed using data from tag information 138, application definition140, and/or infrastructure repository 136. Graph inference module 146may determine relationships between computing resources 110. Forexample, graph inference module 146 may determine that a particularcomputer resource 110 relies on another computer resource 110 in orderto operate or be available to a user at user system 102. Graph inferencemodule 146 organizes these relationships and constructs directed graph500.

Application probability calculator 148 constructs conditionalprobability tables 600 for the directed graph based on user reliabilitydata 124 and instance reliability data 126 for components of an ad hocapplication. Example conditional probability tables 600 generated byapplication probability calculator 148 are discussed further below withrespect to FIG. 6.

Infrastructure probability calculator 150 constructs conditionalprobability tables 600 for the directed causality graph based oninstance reliability data 126 and class reliability data 128 forinfrastructure components relied on by an ad hoc application. For aninfrastructure component, the infrastructure probability calculator 150may access databases for instance reliability data and class reliabilitydata to construct a historical availability record for the component.Example conditional probability tables 600 generated by infrastructureprobability calculator 150 are discussed further below with respect toFIG. 6.

Reliability estimator module 152 calculates reliability estimate 134based on conditional probability tables 600 and directed graph 500.Reliability estimator module 152 may evaluate the inferred directedgraph and constructed conditional probability tables 600 as a Bayesiannetwork to produce reliability estimate 134. Exact computation of thereliability of the primary resource may be possible for simple directedgraphs, such as graphs with only a single path to any component. In manycases, however, the inferred directed graph may not have a directsolution. In some embodiments, reliability estimator module 152 supportsstochastic simulation of the inferred directed graph to compute thereliability of the primary resource. For example, reliability estimatormodule 152 may run a number of trials sampling different availabilityconfigurations according to the conditional probabilities foravailability of each component in the directed graph.

Computing resources 110 may include any suitable computing resourcesthat may be made available over a network (which may or may not benetwork 104). Computing resources 110 may include any suitablecombination of hardware, firmware, and software. As just a few examples,computing resources 110 may include any suitable combination ofapplications, power, processors, storage, and any other suitablecomputing resources that may be made available over a network. Computingresources 110 may each be substantially similar to one another or may beheterogeneous. As described above, entities accessing computing servicesprovided by the provisioned computing resources environment may gainaccess to a suite of elastic IT infrastructure services (e.g., computingresources 110) as the entity requests those services. Provisionedcomputing resources environment 112 may provide a scalable, reliable,and secure distributed computing infrastructure.

In the illustrated example, each computing resource 110 comprisesprocessing unit 130 and memory unit 132. Processing unit 130 may includeone or more microprocessors, controllers, or any other suitablecomputing devices or resources. Processing unit 130 may work, eitheralone or with other components of system 100, to provide a portion orall of the functionality of system 100 described herein. Memory unit 132may take the form of volatile or non-volatile memory including, withoutlimitation, magnetic media, optical media, RAM, ROM, removable media, orany other suitable memory component. In certain embodiments, a portionof all of memory unit 132 may include a database, such as one or morestructured query language (SQL) servers or relational databases.Although FIG. 1 illustrates examples of computing resources 110 thatinclude processing unit 130 and memory unit 132, particular embodimentsmay include one or more computing resources 110 that represent computingresources, components, applications, and/or infrastructure that do notinclude processor unit 130 and memory unit 132.

Server system 106 may be coupled or otherwise associated with a storagemodule 108. Storage module 108 may take the form of volatile ornon-volatile memory including, without limitation, magnetic media,optical media, RAM, ROM, removable media, or any other suitable memorycomponent. In certain embodiments, a portion of all of storage module108 may include a database, such as one or more SQL servers orrelational databases. Storage module 108 may be a part of or distinctfrom memory unit 122 of server system 106.

Storage module 108 may store a variety of information and applicationsthat may be used by server system 106 or other suitable components ofsystem 100. In the illustrated example, storage module 108 may storeuser reliability data 124, instance reliability data 126, classreliability 128, and infrastructure repository 136. Although storagemodule 108 is described as including particular information andapplications, storage module 108 may store any other suitableinformation and applications. Furthermore, although these particularinformation and applications are described as being stored in storagemodule 108, the present description contemplates storing this particularinformation and applications in any suitable location, according toparticular needs.

System 100 provides just one example of an environment in which thereliability estimation for ad hoc applications technique of the presentdisclosure may be used. The present disclosure contemplates use of thedata transfer technique in any suitable computing environment.Additionally, although functionality is described as being performed bycertain components of system 100, the present disclosure contemplatesother components performing that functionality. As just one example,functionality described with reference to server system 106 may beperformed by one or more components of computing resources 110 and/oruser system 102. Furthermore, although certain components areillustrated as being combined or separate, the present disclosurecontemplates separating and/or combining components of system 100 in anysuitable manner. As just one example, server system 106 and one or moreof computing resources 110 may be combined in a suitable manner.

Certain embodiments of the present disclosure may provide some, none, orall of the following technical advantages. For example, certainembodiments provide a reliability estimate for computing resources basedon a user identification of key components and associations. Receivinguser indication of certain component relationships may allow providersto generate a reliability estimate for ad hoc applications withouthaving to disclose infrastructure, network, and computing resourcedetails to a user of the ad hoc application. As a result, particularembodiments of the present disclosure may provide a reliability estimatein a distributed system comprising many disparate components withdiffering levels of availability and redundancy. Thus, providers of adhoc applications may satisfy a user demand for reliability estimateswithout having to reveal the details of the provisioned system.Accordingly, having a quantifiable measure of reliability for anapplication increases trust and lessens the risk of using a cloudprovider or ad hoc applications.

FIG. 2 is a block diagram illustrating an example process in whichreliability estimate 134 is calculated that may be performed by theexample system 100 of FIG. 1. In operation of an example embodiment ofsystem 100, a user at user system 102 interacts with component taggingmodule 142 to apply one or more metadata tags (e.g., tag information138) to computing resources 110, as represented by arrow 201. Forexample, if user system 102 is provisioned with an ad hoc application(e.g., an accounting software package) that runs on two processingcomputing resources 110 and one database computing resource 110, theuser may tag each of the computing resources 110 with the string‘accounting’ to associate the computing resources 110 with theprovisioned ad hoc application.

In some embodiments, multiple users using one or more user systems 102communicate tag information 138 that include the same string. Componenttagging module 142 may disambiguate usage by placing metadata tags intoa namespace associated with the user that applied the tag. For example,component tagging module 142 may record the string ‘accounting’ for afirst user as ‘user1:accounting’ and the string ‘account’ for a secondusers as ‘user2:accounting’. Although component tagging module 142 mayrecord each string using a namespace, when displaying tag information138, component tagging module 142 may hide the namespace from the user.An example of tag information 138 applied to ad hoc applicationcomponents is shown in FIG. 4.

Additionally or alternatively, a user at user system 102 interacts withapplication definition module 144 to create an application definition(e.g., application definition 140) of a provisioned ad hoc application,as represented by arrow 202. Application definition 140 includes atleast a primary computing resource 110 for which reliability estimate134 is to be calculated. In some embodiments, application definition 140may include one or more secondary computing resources 110 that aresupportive of the primary computing resource 110. For example, theprimary computing resource 110 may be an software service while asecondary computing resource 110 may be a web service accessed by thesoftware service. An example application definition 140 is shown in FIG.3. As shown in FIG. 3, a user may construct application definition 140by defining a primary computing resource 110 (e.g., Application), andsecondary computing resources 110 (e.g., Service 1 and Service 2) uponwhich the primary computing resource 110 depends.

In some embodiments, application definition 140 may not define allsecondary computing resources 110 used by a particular ad hocapplication. Thus, graph inference module 146 may use applicationdefinition 140, tag information 138, and infrastructure repository 136to construct the dependencies and interrelationships among the variouscomputing resources 110 utilized by a particular ad hoc application forwhich reliability estimate 134 is sought, represented by arrow 203. Insome embodiments, application definition module 144 defines the startingseeds for the graph inference module 146, and graph inference module 146may expand the user-provided ad hoc application definition 140 into amore comprehensive application definition.

An example directed graph is shown in FIG. 5. Graph inference module 146may recursively expand the ad hoc application definition 140 byfollowing known component associations (as defined by tag information138 and infrastructure repository 136) to produce a directed graph ofcomponent dependencies. For example, as shown in FIG. 5, box 501, graphinference module 146 may access databases (e.g., tag information 138 andinfrastructure repository 136) to obtain one or more secondary computingresources 110 relied upon by primary computing resources 110 (e.g.,Application, as shown in box 501). Graph inference module 146 mayoperate recursively to identify secondary computing resources 110. Forexample, a primary computing resource 110 may have two secondarycomputing resources 110. In this example, graph inference module 146identifies “service 1” and “service 2” as secondary computing resources110 (indicated in boxes 502 and 503), which “application” (indicated inbox 501) is dependent upon. Graph inference module 146 may thendetermine the computing resources 110 upon which the secondary computingresources 110 depend. For example, graph inference module 146 may thendetermine that “service 1” (identified in box 502) depends upon “server1” (identified in box 505) and “database” (identified in box 504). Graphinference module 146 repeats this process for each secondary computingresource 110 identified until no there are no further dependentcomputing resources 110, as indicated by tag information 138 andinfrastructure repository 136. Thus, graph inference module 146generates a directed graph as shown in FIG. 5.

Returning to FIG. 2, in some embodiments, a relied upon computingresource 110 is a computing resource 110 for which there exists a set ofcomponent computing resources 110 (possibly the empty set) such that theprimary computing resource 110 is operable when only the set ofcomponent computing resources 110 is inoperable, and the primarycomputing resource 110 is inoperable when both the relied upon computingresource 110 and the set of component computing resources 110 areinoperable. For example, each disk drive in a pair of redundant drivesis a relied upon component computing resource 110 even though thefailure of any single drive may not cause the primary computing resource110 to become inoperable.

Graph inference module 146 may request infrastructure components thatare allocated to the application components from the infrastructurerepository module 136. Infrastructure components are part of theinfrastructure provider's implementation of a resource and generallykept secret, for example, the arrangement of physical racks, networkswitches, power supplies, air conditioners, fire suppression units,telecommunication links, and buildings.

Graph inference module 146 may request application components associatedwith tags from component tagging module 142. For example, graphinference module 146 may locate application components tagged with anidentifier associated with a resource. FIG. 4 illustrates the exampleapplication definition 140 expanded to include several applicationcomponents found using tag associations. FIG. 4 includes two computingresources 110 (e.g., a first server and a database) tagged with “Service1” as a tag for the first service and a third computing resource 110(e.g., a database) tagged with “Service 2” as a tag for the secondservice.

Graph inference module 146 may display the identified applicationcomponents to a user at user system 102 for validation. Although theinference of application components may be beneficial to the user byreducing time spent defining the application or tagging, an incorrectinference may unnecessarily expand the directed graph. In someembodiments, the graph inference module 146 may support a mechanism forexcluding specific application components shown in a particular directedgraph, for example, by having the user a apply a tag excluding theundesired component (e.g., a “does not require” tag) to override thestandard inference algorithm used by graph inference module 146.

Once directed graph 500 is generated, application probability calculator148 analyzes dependency relationships among application components indirected graph 500 to construct a conditional probability table 600. Forexample, in directed graph 500 shown in FIG. 5, application probabilitycalculator 148 calculates conditional probability tables 600 for each ofApplication (box 501), Service 1 (box 502), Service 2 (box 503),Database (504), Server 1 (505), Server 2 (506), Volume 1 (507), andVolume 2 (508). Data indicating the reliability of applicationcomponents generated by graph inference module 146 communicated fromuser reliability data 124 and instance reliability data 126 may feedinto application probability calculator 148, as shown by arrows 204.Application probability calculator 148 may calculate the expectedavailability of a respective application component in directed graph 500based on the availability of relied upon components. For example,application probability calculator 148 calculates the availability of“Service 1” (as indicated in box 502) based on the availability of“Database” (as indicated in box 504) and “Server 1” (as indicated in box505). For each application component for which a conditional probabilitytable 600 is calculated, application probability calculator 148 mayaccess user reliability data 124 and instance reliability data 126.

Additionally, once directed graph 500 is generated, infrastructureprobability calculator 150 analyzes dependency relationships amonginfrastructure components in directed graph 500 to construct aconditional probability table 600. For example, in directed graph 500shown in FIG. 5, infrastructure probability calculator 150 calculatesconditional probability tables 600 for each of Switch (box 509), Power 1(box 510), Power 2 (box 511) and Room (512). Data indicating thereliability of application components generated by graph inferencemodule 146 communicated from instance reliability data 126 and classreliability data 128 may feed into infrastructure probability calculator150, as shown by arrows 205. Infrastructure probability calculator 150examines the directly connected relied upon components in directed graph500 to construct a conditional probability table 600 for theavailability of the infrastructure component based on the availabilityof the directly connected components. For example, the infrastructureprobability calculator 150 may locate all of the relied upon componentsin directed graph 500 that directly point to a relevant component.Infrastructure probability calculator 150 may then construct aconditional probability table 600 by determining the historicalavailability of directly connect components.

In some embodiments, infrastructure probability calculator 150 mayintroduce a noise term into conditional probability table 600 to obscurethe exact configuration of infrastructure components. The use of noisyprobability may improve accuracy by permitting the infrastructureprovider to include infrastructure details in the model that might berevealed through inspection of reliability estimates.

In some embodiments, infrastructure probability calculator 150 mayfactor either instance reliability data 126 or class reliability data128 more heavily in its calculation. For example, direct observation ofinfrastructure components availability (e.g., instance reliability data126) may be preferred for component availability. If no directobservation exists, the component availability may be estimated based onfleet statistics for the component (e.g., class reliability data 128).

Once conditional probability tables 600 are calculated for eachcomponent in directed graph 500, reliability estimator module 152evaluates the directed graph 500 and conditional probability tables 600as a Bayesian network to produce reliability estimate 134. Exactcomputing of reliability of the primary resource (such as, e.g.,Application in directed graph 500), may be possible for simple directedgraphs, such as graphs with only a single path to any component.However, in many case, directed graph may not have a direct solution. Insome embodiments reliability estimator module 152 may support stochasticsimulation of the inferred directed graph 500 to compute the reliabilityof the primary resource. For example, reliability estimator module 152may run a number of trials sampling different availabilityconfigurations according to the conditional probabilities (as shown,e.g., in conditional probability tables 600) for each component indirected graph 500. The availability of the primary resource may then beestimated by counting the number of failures of the primary resourceaccording to the inferred directed graph 500 over a large number oftrials. Numerous trials may be run in order to obtain reliabilityestimate 134 of the primary resource.

A table displaying the results of an example series of trials isdisclosed in Table 7 of FIG. 7. For example, a first trail (“Trial 1”)begins with reliability estimator 152 assigning availability to Room(box 512 of FIG. 5) in accordance with the conditional probability tableof Room. For purposes of this example, a value of “1” representsavailable, and a value of “0” represents unavailable. In this example,the conditional probability of Room is 1 99.999% of the time, andreliability estimator 152 assigns Room as 1 in 99.999% of trials and 0in 0.001% of trials. In the example Trial 1, Room is assigned as 1 (butin 0.001% of trials will be assigned a 0). Next, reliability estimator152 assigns availability to Power 1 (box 510 of FIG. 5) in accordancewith the conditional probability table 600 of Power 1. In this example,the conditional probability of Power 1 is 1 99.97% of the time when Room(upon which Power 1 relies) is 1. Therefore, reliability estimator 152assigns Power 1 as 1 in 99.97% of the trials in which it assigned Roomas 1, and assigns Power 1 as 0 in 0.03% of the trials in which itassigned Room as 1. In the example Trial 1, reliability estimator 152assigns Power 1 as 1. Reliability estimator 152 performs analogouscalculations for Power 2 (box 511 in FIG. 5) and in the example Trial 1,Power 2 is assigned a 1. Next, reliability estimator 152 assigns Switch(box 509 in FIG. 5) in accordance with its conditional probability tablein which Power 1 and Power 2 (upon which Switch relies) are both 1, andin example Trial 1, is assigned a 1. Next reliability estimator 152assigns Volume 1 (box 507 in FIG. 5) in accordance with its conditionalprobability table in which Switch is 1, and in example Trial 1, isassigned a 1. Similar calculations are performed for each component indirected graph 500, resulting in an availability calculation forApplication. In example Trial 1, Application is assigned 1.

Next, reliability estimator performs a second trial (“Trial 2”), theresults of which are shown in table 700 in FIG. 7. In this example, Roomis assigned 1, Power 1 is assigned 1, Power 2 is assigned 0, Switch isassigned 1, Volume 1 is assigned 1, and Application is assigned 1, inaccordance with the statistical outcomes indicated by their respectiveconditional probability tables.

Successive trials are performed (e.g., Trial 3 through Trial 1,000,000shown in Table 700), and the number of times Application is assigned a 1is compared to the number of times Application is assigned 0 in theaggregate number of trials. For example, reliability estimator module152 may determine that in 99.89% of trials, Application is assigned a 1.Thus, reliability estimate 134 is calculated to be 99.89%. Oncecalculated, reliability estimate 134 may be stored in storage unit 108and/or transmitted to user system 102 to be displayed to a user.

In some embodiments, a series of a trials may represent sampling fromamong all possible combinations of the availability status of eachcomponent in a directed graph. For example, reliability estimator module152 may perform availability sampling to determine reliability estimate134 for a particular primary computing resource 110 (such as, e.g.,Application shown in FIG. 5). Availability sampling may be based on oneor more samples of an availability configuration of a directed graph(such as, e.g., directed graph 500 shown in FIG. 5). An availabilityconfiguration is a permutation of the availability status (where “1”represents available and “0” represents unavailable) assigned to eachcomputing resource 110 in a directed graph (such as, e.g., directedgraph 500. For example, for each availability configuration, eachcomputing resource 110 in the directed graph is either available (i.e.,“1”) or unavailable (i.e., “0”). For each sample availabilityconfiguration, there is a probability that the particular availabilityconfiguration will be observed in practice. Each availabilityconfiguration has a probability between and including 0% and 100%. Someavailability configurations have a 0% chance of being observed. Forexample, it is not possible that a server computing resource 110 isavailable when relied upon power supply computing resources 110 areunavailable. Thus, the probability for an availability configuration inwhich the server computing resource 110 is available (1) and the reliedupon power supply computing resources 110 are unavailable (0) is 0%. Thesum of the probabilities across every possible availabilityconfiguration is 100%.

Reliability estimator module 152 may calculate the probability of aparticular availability configuration based on conditional probabilitytables 600. As discussed further below, conditional probability tables600 give a probability for each component to exist in a particularavailability configuration, given the availability status of relied uponcomponents. Since an availability configuration gives an availabilitystatus for each component in a directed graph simultaneously, theprobability of the availability configuration occurring in practice isthen the product of each of the component probabilities as indicated inthe conditional probability table 600 associated with each component.The set of all availability configurations can be enumerated in a tablein which each row is a particular availability configuration and eachcolumn is a component in a directed graph (such as, e.g., directed graph500).

Since each component in a directed graph is assigned a 0 or 1, the totalnumber of configurations (rows in the table) is 2 to the power of thenumber of components present in the directed graph (i.e., 2^(N)). Even asmall number of components makes examining every row (i.e., theprobability associated with each availability configuration) infeasible.For example, a directed graph with 50 components would have a table withover one quadrillion rows. Therefore, in some embodiments, selectedavailability configurations are sampled in order to calculatereliability estimate 134. Sampling may be performed according to one ormore methods. For example, in some embodiments, reliability estimatormodule 152 may divide the availability configurations into groups ofrelatively equal probability and may select particular samples from eachgroup. The sampling performed may be an orthogonal sampling method, suchas orthogonal Lain hypercube sampling.

In some embodiments, reliability estimator module 152 performs samplingby working backwards from the availability of the primary resource (suchas, e.g., Application in directed graph 500 shown in FIG. 5). Assuming apriori that a primary resource is either available or not available,based on conditional probability tables 600, there is a probability forthe resources that the primary resource relies upon to be available orunavailable in a configuration, given the assumed state for the primaryresource. Reliability estimator module 152 may then sample from amongthese configurations by any appropriate method, such as, for example,greedy algorithmic sampling and/or orthogonal sampling.

After sampling is performed, the sample probabilities are summedaccording to whether the primary resource is available or unavailable,producing two probabilities: an available probability (“A”) and anunavailable probability (“U”). The sum of A and U is greater than orequal to 0 but less than or equal to 1. In some embodiments, reliabilityestimator module 152 scales the available probability to 1 to calculatereliability estimate 134 (i.e., by calculating A/(A+U)).

Reliability estimate 134 is most accurate when A+U is close to 1 andbecomes increasingly inaccurate as A+U approaches 0 since scaling themeasurements is an approximation for the configurations that are notsampled. Reliability estimator module 152 may go back and performadditional sampling if A+U is too small to improve the accuracy ofreliability estimate 134.

FIG. 3 illustrates an example application definition 140 including aprimary resource (e.g., an ad hoc application), and two secondaryresources (e.g., a first service and a second service). In someembodiments, a first and second service are resources a primary resourcedepends upon for operation. For example, if a primary resource is asoftware application hosted on a website, a first service may representa web server, and a second service may represent a database. A user atuser system 102 may tag Service 1 with a “Service 1” tag, Service 2 witha “Service 2” tag, and tag Application with “Service 1” and “Service 2”tags.

FIG. 4 illustrates an example application definition 140 from FIG. 3expanded to include several application components found using tagassociations (such as, e.g., based on tag information 138 received fromuser system 102) and infrastructure repository 136. FIG. 4 includes afirst server and database tagged with Service 1 as an identifier for thefirst service and a second server tagged with Service 2 as an identifierfor the second service.

FIG. 5 illustrates an example directed graph 500 constructed by graphinference module 146 in which the example application definition 140from FIG. 4 is expanded to include several application andinfrastructure components found using allocation and dependencyrelationships (e.g., based on tag information 138 and applicationdefinition 140). FIG. 5 includes a service 1 (box 502), service 2 (box503), database (box 504), server 1 (box 505), server 2 (box 506), firstdrive volume (box 507) and a second drive volume (box 508) applicationcomponents, a network switch (box 509), a first power supply (box 510),a second power supply (box 511) and room (box 512) infrastructurecomponents.

FIG. 6 illustrates conditional probability tables 600a-c (which may bereferred to individually as “conditional probability table 600” orcollectively as “conditional probability tables 600”) for applicationand infrastructure components included in directed graph 500 shown inFIG. 5. Although FIG. 6 shows example conditional probability tables 600based on components illustrated in FIG. 5, it should be understood thatany suitable conditional probability table 600 may be generated based onthe particular configuration of system 100. In particular embodiments,conditional probability table 600 includes the permutations of theavailable and not available status of relied upon components for eachcomponent in a directed graph (e.g., directed graph 500). The status isrepresented as a binary conditional, in which 1 represents available,and 0 represents unavailable. For example, if a primary component reliesupon first and second secondary components, a conditional probabilitytable 600 includes a first row in which the first secondary component is0 and the second secondary component is 0, a second row in which thefirst secondary component is 0 and the second secondary component is 1,a third row in which the first secondary component is 1 and the secondsecondary component is 0, and a fourth row in which the first secondarycomponent is 1 and the second secondary component is 1. Thus, aconditional probability table 600 includes a row for each permutation ofthe availability of directly relied upon components for each componentin a directed graph (e.g., directed graph 500).

Conditional probability tables 600 for application components may becalculated by application probably calculator 148 and infrastructurecomponents may be calculated by infrastructure probability calculator150. For example, conditional probably table 600a illustratesconditional probabilities for Server 1 (box 505 in FIG. 5). Server 1relies upon Volume 1 (box 507 in FIG. 5) and Switch (box 509 in FIG. 5).The available/non-available conditions for Volume 1 and Switch are shownin the first column and second column of conditional probably table600a, respectively. The available/non-available condition for Server 1,which is dependent on the Volume 1 and Switch columns, is shown in thethird column. The availability of Server 1 (expressed as a percentage)is determined based on the availability of Volume 1 and Switch,represented as a binary condition, with 1 representing available, and 0representing not available. For example, with reference to the first rowof conditional probability table 600a, Volume 1 is 0 and Switch is 0,and Server 1 is therefore 0%, because Server 1 is not operational ifVolume 1 and Switch are not available. With reference to the second rowof conditional probability table 600a, Volume 1 is 1 and Switch is 0,and Server 1 is therefore 0%, because Server 1 is not operational ifSwitch is not available. With reference to the third row of conditionalprobability table 600a, Volume 1 is 0 and Switch is 1, and Server 1 istherefore 0%, because Server 1 is not operational if Volume 1 is notavailable. With reference to the fourth row of conditional probabilitytable 600a, Volume 1 is 1 and Switch 1 is 1, and Server 1 is therefore99.8%. If Volume 1 and Switch 1 are available, then the availability ofServer 1 is based on historical reliability metrics data (such as, e.g.,user reliability data 124, instance reliability data 126, and/or classreliability data 128), as discussed above.

Conditional probability table 600b illustrates conditional probabilitiesfor the Database component illustrated in FIG. 5 (box 504). The Databasecomponent relies upon the Switch component (box 509 in FIG. 5). Theavailable/not available condition for the Switch component is shown inthe first column, and the available/not available condition for theDatabase component, which is dependent on the available/non-availablecondition in the Switch column, is shown in the second column. Withreference to the first row of conditional probability table 600b, Switchis 0, and Database is therefore 0%. Because the Database component isdependent upon the Switch component, the Database component is notavailable when the Switch component is not available. With reference tothe second row of conditional probability table 600b, Switch is 1, andDatabase is 99.76%. Because the Switch component is available, theavailability of the Database component is determined from historicalreliability metrics data (such as, e.g., user reliability data 124,instance reliability data 126, and/or class reliability data 128), asdiscussed above.

Conditional probability table 600c illustrates conditional probabilitiesfor the Switch component illustrated in FIG. 5 (box 509). The Switchcomponent relies upon the Power 1 component (box 510) or the Power 2component (box 511). That is, the Power 1 and Power 2 components areredundant dependencies to the Switch component. With reference to thefirst row of conditional probability table 600c, Power 1 is 0 and Power2 is 0, and Switch is therefore 0%. Because the Switch component isdependent upon the Power 1 or Power 2 components, the Switch component511 is not available when both Power 1 and Power 2 are not available.With reference to the second row of conditional probability table 600c,Power 1 is 0 and Power 2 is 1, and Switch is 99.99%. If either Power 1or Power 2 are available, the availability of the Switch component isdetermined from historical reliability metrics data (such as, e.g., userreliability data 124, instance reliability data 126, and/or classreliability data 128), as discussed above. Similarly, with reference tothe third row of conditional probability table 600d, Power 1 is 1 andPower 2 is 0, and Switch is 99.99%. With reference to the fourth row ofconditional probability table 600c, Power 1 is 1 and Power 2 is 1, andSwitch is 99.99%. Since Power 1 and Power 2 are both available (althoughonly either Power 1 or Power two need be available for this condition toresult), the availability is determined from historical reliabilitymetrics data.

FIG. 7 illustrates a table 700 that includes the results of an exampleseries of trials performed by reliability estimator 152 to calculatereliability estimate 134. As discussed above, successive trials areperformed (e.g., Trial 1 through Trial 1,000,000 shown in Table 700),and the number of times Application is assigned a 1 is compared to thenumber of times Application is assigned 0 in the aggregate number oftrials. For example, reliability estimator module 152 may determine thatin a series of 1,000,000 trials, Application is available in 998,990trials, and unavailable in 1100 trials. Thus, reliability estimatormodule calculates reliability estimate 134 to be 99.89%.

FIG. 8 illustrates an example computer system 800 that may be used forone or more portions of the example system 100 of FIG. 1, according tocertain embodiments of the present disclosure. Although the presentdisclosure describes and illustrates a particular computer system 800having particular components in a particular configuration, the presentdisclosure contemplates any suitable computer system having any suitablecomponents in any suitable configuration. Moreover, computer system 800may have take any suitable physical form, such as for example one ormore integrated circuit (ICs), one or more printed circuit boards(PCBs), one or more handheld or other devices (such as mobile telephonesor PDAs), one or more personal computers, one or more super computers,one or more servers, and one or more distributed computing elements.Portions or all of user system 102, server system 106, storage module108, and computing resources 110 may be implemented using all of thecomponents, or any appropriate combination of the components, ofcomputer system 800 described below.

Computer system 800 may have one or more input devices 802 (which mayinclude a keypad, keyboard, mouse, stylus, or other input devices), oneor more output devices 804 (which may include one or more displays, oneor more speakers, one or more printers, or other output devices), one ormore storage devices 806, and one or more storage media 808. An inputdevice 802 may be external or internal to computer system 800. An outputdevice 804 may be external or internal to computer system 800. A storagedevice 806 may be external or internal to computer system 800. A storagemedium 808 may be external or internal to computer system 800.

System bus 810 couples subsystems of computer system 800 to each other.Herein, reference to a bus encompasses one or more digital signal linesserving a common function. The present disclosure contemplates anysuitable system bus 810 including any suitable bus structures (such asone or more memory buses, one or more peripheral buses, one or more alocal buses, or a combination of the foregoing) having any suitable busarchitectures. Example bus architectures include, but are not limitedto, Industry Standard Architecture (ISA) bus, Enhanced ISA (EISA) bus,Micro Channel Architecture (MCA) bus, Video Electronics StandardsAssociation local (VLB) bus, Peripheral Component Interconnect (PCI)bus, PCI-Express bus (PCI-X), and Accelerated Graphics Port (AGP) bus.

Computer system 800 includes one or more processors 812 (or centralprocessing units (CPUs)). A processor 812 may contain a cache 814 fortemporary local storage of instructions, data, or computer addresses.Processors 812 are coupled to one or more storage devices, includingmemory 816. Memory 816 may include RAM 818 and ROM 820. Data andinstructions may transfer bi-directionally between processors 812 andRAM 818. Data and instructions may transfer uni-directionally toprocessors 812 from ROM 820. RAM 818 and ROM 820 may include anysuitable computer-readable storage media.

Computer system 800 includes fixed storage 822 coupled bi-directionallyto processors 812. Fixed storage 822 may be coupled to processors 812via storage control unit 807. Fixed storage 822 may provide additionaldata storage capacity and may include any suitable computer-readablestorage media. Fixed storage 822 may store an operating system (OS) 824,one or more executables (EXECs) 826, one or more applications orprograms 828, data 830 and the like. Fixed storage 822 is typically asecondary storage medium (such as a hard disk) that is slower thanprimary storage. In appropriate cases, the information stored by fixedstorage 822 may be incorporated as virtual memory into memory 816. Incertain embodiments, fixed storage 822 may include network resources,such as one or more storage area networks (SAN) or network-attachedstorage (NAS).

Processors 812 may be coupled to a variety of interfaces, such as, forexample, graphics control 832, video interface 834, input interface 836,output interface 837, and storage interface 838, which in turn may berespectively coupled to appropriate devices. Example input or outputdevices include, but are not limited to, video displays, track balls,mice, keyboards, microphones, touch-sensitive displays, transducer cardreaders, magnetic or paper tape readers, tablets, styli, voice orhandwriting recognizers, biometrics readers, or computer systems.Network interface 840 may couple processors 812 to another computersystem or to network 842. Network interface 840 may include wired,wireless, or any combination of wired and wireless components. Suchcomponents may include wired network cards, wireless network cards,radios, antennas, cables, or any other appropriate components. Withnetwork interface 840, processors 812 may receive or send informationfrom or to network 842 in the course of performing steps of certainembodiments. Certain embodiments may execute solely on processors 812.Certain embodiments may execute on processors 812 and on one or moreremote processors operating together.

In a network environment, where computer system 800 is connected tonetwork 842, computer system 800 may communicate with other devicesconnected to network 842. Computer system 800 may communicate withnetwork 842 via network interface 840. For example, computer system 800may receive information (such as a request or a response from anotherdevice) from network 842 in the form of one or more incoming packets atnetwork interface 840 and memory 816 may store the incoming packets forsubsequent processing. Computer system 800 may send information (such asa request or a response to another device) to network 842 in the form ofone or more outgoing packets from network interface 840, which memory816 may store prior to being sent. Processors 812 may access an incomingor outgoing packet in memory 816 to process it, according to particularneeds.

Certain embodiments involve one or more computer-storage products thatinclude one or more tangible, computer-readable storage media thatembody software for performing one or more steps of one or moreprocesses described or illustrated herein. In certain embodiments, oneor more portions of the media, the software, or both may be designed andmanufactured specifically to perform one or more steps of one or moreprocesses described or illustrated herein. Additionally oralternatively, one or more portions of the media, the software, or bothmay be generally available without design or manufacture specific toprocesses described or illustrated herein. Example computer-readablestorage media include, but are not limited to, CDs (such as CD-ROMs),FPGAs, floppy disks, optical disks, hard disks, holographic storagedevices, ICs (such as ASICs), magnetic tape, caches, PLDs, RAM devices,ROM devices, semiconductor memory devices, and other suitablecomputer-readable storage media. In certain embodiments, software may bemachine code which a compiler may generate or one or more filescontaining higher-level code which a computer may execute using aninterpreter.

As an example and not by way of limitation, memory 816 may include oneor more tangible, computer-readable storage media embodying software andcomputer system 800 may provide particular functionality described orillustrated herein as a result of processors 812 executing the software.Memory 816 may store and processors 812 may execute the software. Memory816 may read the software from the computer-readable storage media inmass storage device 816 embodying the software or from one or more othersources via network interface 840. When executing the software,processors 812 may perform one or more steps of one or more processesdescribed or illustrated herein, which may include defining one or moredata structures for storage in memory 816 and modifying one or more ofthe data structures as directed by one or more portions the software,according to particular needs.

In certain embodiments, the described processing and memory elements(such as processors 812 and memory 816) may be distributed acrossmultiple devices such that the operations performed utilizing theseelements may also be distributed across multiple devices. For example,software operated utilizing these elements may be run across multiplecomputers that contain these processing and memory elements. Othervariations aside from the stated example are contemplated involving theuse of distributed computing.

In addition or as an alternative, computer system 800 may provideparticular functionality described or illustrated herein as a result oflogic hardwired or otherwise embodied in a circuit, which may operate inplace of or together with software to perform one or more steps of oneor more processes described or illustrated herein. The presentdisclosure encompasses any suitable combination of hardware andsoftware, according to particular needs.

Although the present disclosure describes or illustrates particularoperations as occurring in a particular order, the present disclosurecontemplates any suitable operations occurring in any suitable order.Moreover, the present disclosure contemplates any suitable operationsbeing repeated one or more times in any suitable order. Although thepresent disclosure describes or illustrates particular operations asoccurring in sequence, the present disclosure contemplates any suitableoperations occurring at substantially the same time, where appropriate.Any suitable operation or sequence of operations described orillustrated herein may be interrupted, suspended, or otherwisecontrolled by another process, such as an operating system or kernel,where appropriate. The acts can operate in an operating systemenvironment or as stand-alone routines occupying all or a substantialpart of the system processing.

Moreover, data transfer techniques consistent with the presentdisclosure may be used to communicate any suitable type of data over anysuitable type of network. For example, although the present disclosurehas been described primarily with reference to reliability metrics data,the present disclosure contemplates processing any suitable type of datafor communication of a communication network (e.g., network 104).

What is claimed is:
 1. A system comprising: one or more memory unitswith executable instructions; and one or more processing units that,when executing the instructions in the one or more memory units, areoperable to: receive an application definition associated with an ad hocapplication provisioned from one or more computing resources deliveredover a network, the application definition identifying a first group ofcomponents, the first group of components comprising the ad hocapplication and one or more computing resources relied on by the ad hocapplication; receive tag information from a user, the tag informationindicating one or more aspects of the first group of components; accessinfrastructure data from an infrastructure repository, theinfrastructure data identifying a second group of components, the secondgroup of components comprising one or more computing resources of adistributed architecture that are associated with at least a subset ofone or more components in the first group of components; generate aplurality of conditional probability tables, one conditional probabilitytable for at least a first subset of the components in the first groupof components and at least a second subset of the second group ofcomponents, the plurality of conditional probability tables identifyingat least an availability of a respective component of at least the firstsubset of the first group of components or at least the second subset ofthe second group of components based at least in part on a secondavailability of one or more relied upon components of the first group ofcomponents or the second group of component, where the one or morerelied upon components are components utilized, at least in part, duringoperation of the respective component; and based at least in part on theplurality of conditional probability tables, calculate a reliabilityestimate for the ad hoc application by at least performing a pluralityof trials, wherein performing the plurality of trials comprisesassigning a status of either available or not available to at least aportion of the components in a directed graph, the status based at leastin part on a particular conditional probability table associated with aparticular component and the status of one or more directly relied uponcomponents.
 2. The system of claim 1, wherein the one or more processingunits are further operable to: based at least in part on the applicationdefinition, the tag information, and the infrastructure data, generatethe directed graph, the directed graph comprising the components fromthe first group of components and second group of components andindicating one or more dependency relationships among the components;and wherein at least a portion of the plurality of conditionalprobability tables table is are associated with at least one of the oneor more components in the directed graph, and the one or more processingunits are further operable to calculate the reliability estimate basedat least in part on the plurality of conditional probability tables andthe directed graph.
 3. The system of claim 1, wherein the processingunits are further operable to access reliability metrics data for atleast the first subset of the first group of components and at least thesecond subset of the second group of components, wherein the reliabilitymetrics data comprise at least one of: user reliability data, the userreliability data comprising historical availability data of the ad hocapplication determined by one or more users of the ad hoc application;instance reliability data, the instance reliability data comprisinghistorical availability of a particular component associated with the adhoc application; class reliability data, the class reliability datacomprising historical availability data associated with a plurality oftypes of components associated with the ad hoc application; and whereinthe processing units are operable to generate the conditionalprobability table for at least the first subset of the first group ofcomponents and at least the second subset of the second group ofcomponents based at least in part on the reliability metrics data. 4.The system of claim 1, wherein the one or more processing units areoperable to generate the conditional probability table for at least thefirst subset of the first group of components and at least the secondsubset of the second group of components by: for at least a third subsetof components in the first group of components and the second group ofcomponents, determining the one or more relied upon components;generating one or more rows in the conditional probability table, theone or more rows comprising a subset of permutations, the subset ofpermutations indicating availability of at least a portion of the one ormore relied upon components by the third subset of components; and forthe one or more rows, determining the availability of the respectivecomponent based at least in part on the subset of permutations.
 5. Thesystem of claim 1, wherein the one or more processing units are operableto calculate the reliability estimate for the ad hoc application by:after performing the plurality of trials, calculating a first number oftimes the ad hoc application is assigned a status of available; afterperforming the plurality of trials, calculating a second number of timesthe ad hoc application is assigned a status of not available; andcomparing the first number of times the ad hoc application is assigned astatus of available to the second number of times the ad hoc applicationis assigned a status of not available.
 6. The system of claim 1, furthercomprising, for at least a portion of the plurality of conditionalprobability tables, combining the availability of the respectivecomponent identified in the conditional probability table with agenerated number.
 7. The system of claim 2, wherein the one or moreprocessing units are operable to calculate the reliability estimate forthe ad hoc application by: sampling a plurality of availabilityconfigurations from a set of all availability configurations, thesampled availability configurations based at least in part on thedirected graph and indicating a status of available or not available toat least a portion of the components in the directed graph; and for atleast a subset of the sampled availability configuration, determiningthe probability of the availability configuration based at least in parton a particular conditional probability table associated with at least asubset of the components in the directed graph.
 8. The system of claim7, wherein the one or more processing units are operable to sample theplurality of availability configurations based at least in part on ahypercube sampling algorithm.
 9. The system of claim 7, wherein the oneor more processing units are further operable to calculate thereliability estimate by summing the probabilities of the sampledavailability configurations.
 10. The system of claim 1, wherein the oneor more processing units are further operable to calculate thereliability estimate for the ad hoc application based at least in parton a result of the plurality of trials.
 11. A computer-implementedmethod, comprising: identifying one or more components associated withan ad hoc application and upon which the ad hoc application relies,wherein identifying one or more components comprises: obtaining anapplication definition associated with the ad hoc application, theapplication definition received from a user and comprising a first groupof components, the first group of components including the ad hocapplication and one or more components relied upon by the ad hocapplication; obtaining tag information, the tag information indicatingone or aspects of the first group of components; and obtaining, based atleast in part on the tag information and the application data,infrastructure data from an infrastructure repository, theinfrastructure data identifying a second group of components, the secondgroup of components comprising one or more computing resources of adistributed architecture associated with the ad hoc application;generating a directed graph, the directed graph comprising at least asubset of components of the first group of components and the secondgroup of components and indicating one or more dependency relationshipsamong the subset of components; generating a plurality of conditionalprobability tables, based at least in part on the subset of componentsin the directed graph, wherein the plurality of conditional probabilitytables are based at least in part on one or more of the dependencyrelationships identified in the directed graph and indicatesavailability of a respective component based at least in part onavailability of at least one relied upon component; and calculating,based at least in part on the directed graph, the reliability estimatefor the ad hoc application, wherein calculating the reliability estimatefor the ad hoc application comprises performing a plurality of trials,wherein performing the plurality of trials comprises, assigning a statusof either available or not available to at least a portion of thecomponents in the directed graph, the assigned status based at least inpart on a particular conditional probability table associated with aparticular component of the portion of the components in the directedgraph and the assigned status of one or more directly relied uponcomponents.
 12. The method of claim 11, wherein generating the directedgraph comprises Generating the directed graph based at least in part onthe application definition, the tag information, the infrastructure dataand the conditional probability tables.
 13. The method of claim 12,further comprising: obtaining reliability metrics data associated withone or more components in the directed graph, the reliability metricsdata comprising at least one of: user reliability data, the userreliability data comprising historical availability data of the ad hocapplication determined by one or more users of the ad hoc application;instance reliability data, the instance reliability data comprisinghistorical availability of components associated with the ad hocapplication; and class reliability data, the class reliability datacomprising historical availability data associated with one or moretypes of components associated with the ad hoc application.
 14. Themethod of claim 13, wherein assigning the status to at least a portionof the components comprises: determining whether a particular componentof a portion of components directly relied upon by a respectivecomponent is assigned available or not available status; and if theassigned status of the particular component is available, assigning thestatus to the respective component based at least in part on thereliability metrics data.
 15. The method of claim 11, whereincalculating a reliability estimate comprises: after performing theplurality of trials, calculating a first number of times the ad hocapplication is assigned the status of available; after performing theplurality of trials, calculating a second number of times the ad hocapplication is assigned the status of not available; and comparing thefirst number of times the ad hoc application is assigned the status ofavailable to the second number of times the ad hoc application isassigned the status of not available.
 16. The method of claim 12,wherein calculating the reliability estimate for the ad hoc applicationcomprises: sampling a plurality of availability configurations from aset of availability configurations, at least a subset of theavailability configurations based at least in part on the directed graphand indicating the status of available or not available for at least theportion of the components in the directed graph; and for at least asubset of the sampled availability configurations, determining theprobability of a particular availability configuration based at least inpart on the conditional probability table associated with a portion ofthe components in the directed graph.
 17. The method of claim 16,wherein sampling the plurality of availability configurations comprisessampling the plurality of availability configurations based at least inpart on a hypercube sampling algorithm.
 18. The method of claim 16,wherein calculating the reliability estimate further comprises summingthe probabilities of the sampled availability configurations.
 19. Anon-transitory computer-readable medium comprising logic, the logic whenexecuted by one or more processing units operable to perform operationscomprising: receiving, from a user, a request for a reliability estimateassociated with an ad hoc application; in response to the request,identifying one or more components associated with the ad hocapplication and upon which the ad hoc application relies; accessinginfrastructure data from an infrastructure repository, theinfrastructure data identifying a second group of components, the secondgroup of components comprising one or more computing resources of adistributed architecture associated with the ad hoc application;generating a directed graph, the directed graph comprising one or moreidentified components and indicating one or more dependencyrelationships among the one or more identified components; accessingreliability metrics data, the reliability metrics data comprising atleast one of user reliability data, instance reliability data, and classreliability data, the reliability metrics data associated with one ormore identified components in the directed graph; generating a pluralityof conditional probability tables, at least one conditional probabilitytable for at least a subset of the one or more identified components inthe directed graph, wherein at least a portion of the plurality ofconditional probability tables are based at least in part on the one ormore of the dependency relationships identified in the directed graphand indicating availability of a respective component based at least inpart on availability of at least one relied upon component; calculating,based at least in part on the directed graph and the reliability metricsdata, the reliability estimate for the ad hoc application, whereincalculating the reliability estimate for the ad hoc applicationcomprises performing a plurality of trials, wherein performing theplurality of trials comprises, assigning a status of either available ornot available to at least a subset of the one or more identifiedcomponents in the directed graph, the assigned status being based atleast in part on a particular conditional probability table of theplurality of conditional probability tables associated with a particularcomponent and the assigned status of one or more directly relied uponcomponents; and transmitting the reliability estimate to the user. 20.The non-transitory computer-readable medium of claim 19, whereinidentifying the one or more components associated with the ad hocapplication comprises: accessing an application definition associatedwith the ad hoc application, the application definition received fromthe user and comprising a first group of components, the first group ofcomponents including the ad hoc application and one or more componentsrelied upon by the ad hoc application; accessing tag information, thetag information indicating one or more aspects of the first group ofcomponents; and accessing, based at least in part on the tag informationand application data, infrastructure data from an infrastructurerepository, the infrastructure data identifying the second group ofcomponents, the second group further comprising one or more componentsof the ad hoc application.
 21. The non-transitory computer-readablemedium of claim 20, wherein generating the directed graph comprises:generating the directed graph based at least in part on the applicationdefinition, the tag information, the infrastructure data and theconditional probability tables.
 22. The non-transitory computer-readablemedium of claim 19, wherein: the user reliability data compriseshistorical availability data of the ad hoc application determined by oneor more other users of the ad hoc application; the instance reliabilitydata comprises historical availability of a particular componentassociated with the ad hoc application; and the class reliability datacomprises historical availability data associated with one or more typesof components associated with the ad hoc application.
 23. Thenon-transitory computer-readable medium of claim 19, wherein assigningthe status to at least the subset of the one or more identifiedcomponents comprises: determining whether the subset of the one or moreidentified components directly relied upon by the respective componentis assigned available or not available status; and if the status of thesubset of the one or more identified components is available, assigningthe status to the respective component based at least in part on thereliability metrics data.
 24. The non-transitory computer-readablemedium of claim 19, wherein calculating the reliability estimatecomprises: after performing the plurality of trials, calculating a firstnumber of times the ad hoc application is assigned the status ofavailable; after performing the plurality of trials, calculating asecond number of times the ad hoc application is assigned the status ofnot available; and comparing the first number of times the ad hocapplication is assigned the status of available to the second number oftimes the ad hoc application is assigned the status of not available.25. The non-transitory computer-readable medium of claim 21, wherein theoperations further comprise, for at least a subset of the plurality ofconditional probability tables, combining an availability of the one ormore identified components in the a particular conditional probabilitytable with a generated number.
 26. The non-transitory computer-readablemedium of claim 21, wherein the logic is operable to calculate thereliability estimate for the ad hoc application by: sampling a pluralityof availability configurations from a set of availabilityconfigurations, the set of the availability configurations based atleast in part on the directed graph and indicating the status ofavailable or the status of not available for at least a subset of theone or more identified components in the directed graph; and for atleast a portion of the sampled availability configuration, determining aprobability of the availability configuration based at least in part ona particular conditional probability table associated with for at leasta subset of the one or more identified components in the directed graph.27. The non-transitory computer-readable medium of claim 26, wherein thelogic is operable to sample the plurality of availability configurationsbased at least in part on a hypercube sampling algorithm.
 28. Thenon-transitory computer-readable medium of claim 26, wherein the logicis further operable to calculate the reliability estimate by summing theprobabilities of the sampled availability configurations.
 29. Thenon-transitory computer-readable medium of claim 19, wherein the logicis further operable to determine the plurality of trials such that theplurality of trials represents a sampling from among a set of possiblecombinations of the assigned status of the one or more identifiedcomponents in the directed graph.
 30. A computer-implemented method,comprising: receiving an application definition associated with an adhoc application identifying a first set of components of a distributedarchitecture for executing the ad hoc application; obtaining informationidentifying a second set of components of a distributed architectureassociated with the first set of components; generating a plurality ofconditional probability tables for the first set of components and thesecond set of components, the conditional probability table identifyingan availability of a first component of the first set of componentsbased at least in part on a second availability of a second component ofthe second set of components, where the first component relies on thesecond component during operation of the first component duringexecution of the ad hoc application; and calculating a reliabilityestimate for the ad hoc application based at least in part on theconditional probability table by at least performing a plurality oftrials, wherein performing the plurality of trials comprises assigning astatus of either available or not available to at least a portion of thecomponents in a directed graph, the status based at least in part on aparticular conditional probability table associated with a particularcomponent and the status of one or more directly relied upon components.31. The computer-implemented method of claim 30, wherein the computerimplemented method further comprises generating the directed graph basedat least in part on the application definition, infrastructure data andthe conditional probability tables.
 32. The computer-implemented methodof claim 31, wherein the computer-implemented method further comprising:obtaining reliability metrics data associated with one or morecomponents in the directed graph, the reliability metrics datacomprising at least one of: user reliability data, the user reliabilitydata comprising historical availability data of the ad hoc applicationdetermined by one or more users of the ad hoc application; instancereliability data, the instance reliability data comprising historicalavailability of components associated with the ad hoc application; andclass reliability data, the class reliability data comprising historicalavailability data associated with one or more types of componentsassociated with the ad hoc application.
 33. The computer-implementedmethod of claim 31, wherein assigning the status to the portion of thecomponents comprises: determining whether a particular component of aportion of components directly relied upon by a respective component isassigned available or not available status; and if the assigned statusof the particular component is available, assigning the status to therespective component based at least in part on the reliability metricsdata.
 34. A non-transitory computer-readable storage medium havingstored thereon executable instructions that as a result of beingexecuted by one or more processors of computer system, cause thecomputer system to at least: obtain a request for a reliability estimateassociated with an ad hoc application, the ad hoc application associatedwith an application definition indicating a set of components used toexecute the ad hoc application; in response to the request: determine afirst subset of components of the set of components that have adependency relationship with at least a component of a second subset ofcomponents of the set of components; generate a plurality of conditionalprobability tables, at least one conditional probability table of theplurality of conditional probability tables for the first subset ofcomponents based at least in part on an availability of the secondcomponent; determine the reliability estimate for the ad hoc applicationbased at least in part on the plurality of conditional probabilitytables, wherein determining the reliability estimate for the ad hocapplication comprises performing a plurality of trials, whereinperforming the plurality of trials comprises, assigning a status ofeither available or not available to at least a subset of the set ofcomponents in a directed graph, the assigned status being based at leastin part on at least the one conditional probability table of theplurality of conditional probability tables associated with a particularcomponent and the assigned status of one or more directly relied uponcomponents; and transmit the reliability estimate in response to therequest.
 35. The non-transitory computer-readable medium of claim 34,wherein the instructions further comprise instructions that, as a resultof being executed by the one or more processors, cause the computersystem to generate the directed graph based at least in part on theapplication definition, infrastructure data, and the plurality ofconditional probability tables.
 36. The non-transitory computer-readablemedium of claim 34, wherein the instructions further compriseinstructions that, as a result of being executed by the one or moreprocessors, cause the computer system to: determine whether thecomponent of the second subset of components of the set of components isassigned a status of available or not available; and if the status isavailable, assigning the status to the first subset of components. 37.The non-transitory computer-readable storage medium of claim 34, whereinthe instructions further comprise instructions that, as a result ofbeing executed by the one or more processors, cause the computer systemto obtain information indicating availability of the second subset ofcomponents from an infrastructure repository.
 38. The non-transitorycomputer-readable storage medium of claim 34, wherein the instructionsfurther comprise instructions that, as a result of being executed by theone or more processors, cause the computer system to obtain an inputfrom a user through a user interface, the input indicating thedependency relationship.
 39. The non-transitory computer-readablestorage medium of claim 34, wherein during at least a portion of theplurality of trials assigning a status of either available or notavailable to the second component.
 40. The non-transitorycomputer-readable storage medium of claim 34, wherein the instructionsthat cause the computer system to assign the status of either availableor not available to the second component further include instructionsthat cause the computer system to assign the status of either availableor not available to the second component based at least in part oninstance reliability data.
 41. The non-transitory computer-readablestorage medium of claim 40, wherein the instructions that cause thecomputer system to assign the status of either available or notavailable to the second component further include instructions thatcause the computer system to assign the status of either available ornot available to the second component based at least in part on classreliability data.
 42. A system, comprising: one or more processors; andmemory to store computer-executable instructions that, if executed,cause the one or more processors to: receive an application definitionfor a set of components executing the application; determine a firstsubset of components of the set of components associated with a secondsubset of components of the set of components; generate a conditionalprobability table for the first subset of components and the secondsubset of components based at least in part on an availability of afirst component of the first subset of components associated with asecond component of the second subset of components, where the firstcomponent relies on the second component during executing of theapplication; and calculate a reliability estimate for the applicationbased at least in part on the conditional probability table, whereincalculating the reliability estimate for the application comprisesperforming a plurality of trials, wherein performing the plurality oftrials comprises assigning a status of either available or not availableto at least a subset of the set of components in a graph, the assignedstatus being based at least in part on the conditional probability tableand the assigned status of one or more directly relied upon components.43. The system of claim 42, wherein the memory further includesinstructions that, if executed, cause the one or more processors togenerate the graph comprising the set of components based at least inpart on the application definition and the conditional probabilitytable.
 44. The system of claim 43, wherein the memory further includesinstructions that, if executed, cause the one or more processors togenerate a plurality of availability configurations based at least inpart on the graph indicating a status of available or not available forat least the portion of the set of components included in the graph. 45.The system of claim 42, wherein the memory further includes instructionsthat, if executed, cause the one or more processors to sample theplurality of availability configurations based at least in part on asampling algorithm.
 46. The system of claim 43, wherein thecomputer-executable instructions that cause the one or more processorsto generate the graph further includes computer-executable instructionsthat, if executed, cause the one or more processors to generate thegraph further based at least in part of tag information associated witheach component of the set of components.
 47. The system of claim 42,wherein the computer-executable instructions that cause the one or moreprocessors to calculate the reliability estimate further includescomputer-executable instructions that, if executed, cause the one ormore processors to: calculate a first value indicating a first number oftrials of the plurality of trials the application is assigned a statusof available; calculate a second value indicating a second number oftrials of the plurality of trials the application is assigned a statusof not available; and compare the first value and the second value. 48.The system of claim 42, wherein the memory further includes instructionsthat, if executed, cause the one or more processors to obtainreliability metrics data associated with one or more components of theset of components comprising at least one of user reliability data,instance reliability data, and class reliability data.
 49. The system ofclaim 48, wherein the computer-executable instructions that cause theone or more processors to calculate the reliability estimate furtherincludes computer-executable instructions that, if executed, cause theone or more processors to calculate the reliability estimate based atleast in part on the reliability metrics data and the conditionalprobability table.